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O . Abstract: The ^-penalized method, or the Lasso, has emerged as an important tool 

D . 

Q ■ for the analysis of large data sets. Many important results have been obtained for 

■ the Lasso in linear regression which have led to a deeper understanding of high- 

dimensional statistical problems. In this article, we consider a class of weighted 
£i-penalized estimators for convex loss functions of a general form, including the 
generalized linear models. We study the estimation, prediction, selection and sparsity 
properties of the weighted ^-penalized estimator in sparse, high-dimensional settings 
where the number of predictors p can be much larger than the sample size n. Adaptive 
Lasso is considered as a special case. A multistage method is developed to apply an 
adaptive Lasso recursively. We provide £ g oracle inequalities, a general selection 
| consistency theorem, and an upper bound on the dimension of the Lasso estimator. 

^ ! Important models including the linear regression, logistic regression and log-linear 

^ . models are used throughout to illustrate the applications of the general results. 
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1 Introduction 

High-dimensional data arise in many diverse fields of scientific research. For example, 
in genetic and genomic studies, more and more large data sets are being generated 
with rapid advances in biotechnology, where the total number of variables p is 
larger than the sample size n. Fortunately, statistical analysis is still possible for a 
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substantial subset of such problems with a sparse underlying model where the number 
of important variables is much smaller than the sample size. A fundamental problem 
in the analysis of such data is to find reasonably accurate sparse solutions that are 
easy to interpret and can be used for the prediction and estimation of covariable 
effects. The ^-penalized method, or the Lasso |Tib96t ICDS98] . has emerged as an 
important approach to finding such solutions in sparse, high- dimensional statistical 
problems. 

In the last few years, considerable progress has been made in understanding 
the theoretical properties of the Lasso in p ^> n settings. Most results have been 
obtained for linear regression models with a quadratic loss. |GR04j studied the 
prediction performance of the Lasso in high-dimensional least squares regression. 
[MB06] showed that, for neighborhood selection in the Gaussian graphical models, 
under a neighborhood stability condition on the design matrix and certain additional 
regularity conditions, the Lasso is selection consistent even when p — > oo at a rate 
faster than n. |ZY06j formalized the neighborhood stability condition in the context 
of linear regression as a strong irrepresentable condition. [CT07] derived an upper 
bound for the £2 loss for the estimation of regression coefficients with a closely 
related Dantzig selector under a condition on the number of nonzero coefficients 
and a uniform uncertainty principle on the design matrix. Similar results have been 
obtained for the Lasso. For example, upper bounds for the £ q loss of the Lasso 
estimator has being established by [BTW07] for q = 1, [ZH08] for q e [1; 2], |MY09] 
for q = 2, |BRT09] for q G [1;2], and |Zha091 IYZ10] for general q > 1. For 
convex minimization methods beyond linear regression, |vdGQ8] studied the Lasso 
in high-dimensional generalized linear models (GLM) and obtained prediction and l\ 
estimation error bounds. |NRWY10] studied penalized M-estimators with a general 
class of regularizers, including an £2 error bound for the Lasso in GLM under a 
restricted convexity and other regularity conditions. 

Theoretical studies of the Lasso have revealed that it may not perform well for 
the purpose of variable selection, since its required irrepresentable condition is not 
properly scaled in the number of relevant variables. In a number of simulation 
studies, the Lasso has shown weakness in variable selection when the number of 
nonzero regression coefficients increases. As a remedy, a number of proposals have 
been introduced in the literature, including concave penalized LSE |FL01t IZhalOa] , 
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adaptive Lasso |Zou06] , and stepwise regression |Zhallj . Although extensions of the 
concave penalized LSE is beyond the scope of this paper, adaptive Lasso is studied 
here as a weighted Lasso with estimated weights. When the number of predictors p 
is fixed, |Zou06] proved that the adaptive Lasso has the asymptotic oracle property 
In linear regression models. |HMZ08] showed that the oracle property continues to 
hold for the adaptive Lasso in p 3> n settings under an adaptive irrepresentable and 
other regularity conditions. |MB07] suggested using the Lasso as the initial estimator 
for the adaptive Lasso or even a multi-step adaptive Lasso. The one-step method of 
|ZL08j . designed to approximate penalized estimators with concave penalties, can be 
also viewed as adaptive Lasso. 

In this article, we consider a class of weighted ^-penalized estimators with a 
convex loss function. This class includes the Lasso, adaptive Lasso and multistage 
recursive application of an adaptive Lasso in generalized linear models as special 
cases. We study the estimation, prediction, selection and sparsity properties of the 
weighted ^-penalized estimator based on a convex loss in sparse, high-dimensional 
settings where the number of predictors p can be much larger than the sample size 
n. The main contributions of this work follows. 

• We extend the existing theory for the unweighted Lasso from linear regression 
to more general convex loss function. 

• We develop a multistage method with recursive applications of an adaptive 
Lasso and provide sharper risk bound than those for unweighted Lasso. 

• We apply our general results to a number of important special cases, including 
the linear, logistic and log-linear regression models. 

This article is organized as follows. In Section [2] we describe a general formulation 
of the absolute penalized minimization problem with a convex loss, along with 
two basic inequalities and a number of examples. In Section [3] we develop oracle 
inequalities for the weighted Lasso estimator for general quasi star-shaped loss 
functions and an £ 2 bound on the prediction error. In Section H] we develop sharper 
oracle inequalities for multistage recursive applications of an adaptive Lasso. In 
Section |5] we derive sufficient conditions for selection consistency. In Section [6] we 
provide an upper bound on the dimension of the Lasso estimator. Concluding remarks 
are given in Section [7J All proofs are provided in an appendix. 
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2 Absolute penalized convex minimization 



2.1 Definition and the KKT conditions 



We consider a general convex loss function of the form 



(1) 



where is a known convex function, z is observed and /3 is unknown. Unless 

otherwise stated, the inner product space is R p , so that {z, f3} C R p and (f3, z) = (3'z. 
Our analysis of ([T]) requires certain smoothness of the function in terms of its 

differentiability. In what follows, such smoothness assumptions are always explicitly 
described by invoking the derivative of ip. For any v = (vi, . . . ,v p )', we use \\v\\ to 
denote a general norm of v and \v\ q the £ q norm (^) . Ifjl 9 ) 1 / 9 , with |f |oo = max.,- \vj\. 
Let w G 1R P be a (possibly estimated) weight vector with nonnegative elements Wj, 1 < 
j < p, and PU = diag(u}). The weighted absolute penalized estimator, or weighted 
Lasso, is defined as 



Here we focus on the case where W is diagonal. In linear regression, |TT11] 
considered non-diagonal, predetermined W and derived an algorithm for computing 
the solution paths. 

A vector (3 is a global minimizer in (T5]) if and only if the negative gradient at (3 
satisfies the Karush-Kuhn- Tucker (KKT) conditions, 



where 1(0) = {d/d(3)£{f3) and = (<9/<9/3)^(/3). Since the KKT conditions are 



necessary and sufficient for fl2]), results on the performance of /3 can be viewed as 
analytical consequences of (j3J). 

The estimator (j2J) includes the ^-penalized estimator, or the Lasso, with the 
choice Wj = 1,1 < j < p. A careful study of the (unweighted) Lasso in general 
convex minimization ([[]) is by itself an interesting and important problem. Our work 
includes the Lasso as a special case since Wj = 1 is allowed in all our theorems. 




(2) 




(3) 
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In practice, unequal Wj arise in many ways. In adaptive Lasso [Zou06j . a 
decreasing function of a certain initial estimator of (3j is used as the weight Wj to 
remove the bias of the Lasso. In [FLOTl IZL08l IZhaTHb] . the weights Wj are computed 
iteratively with Wj = p\(/3j), where p\(t) = (d/dt)p\(t) with a suitable concave 
penalty function p\(t). This is also designed to remove the bias of the Lasso, since 
the concavity of p\{t) guarantees smaller weight for larger (3j. In Section 4, we provide 
results on the improvements of this weighted Lasso over the standard Lasso. In linear 
regression, [ZhalObj gave suitable conditions under which this iterative algorithm 
provides smaller weights Wj for most large f3j. Such nearly unbiased methods are 
expected to produce better results than the Lasso when a significant fraction of 
nonzero \/3j\ are of the order A or larger. Regardless of the computational methods, 
the results in this paper demonstrate the benefits of using data dependent weights in 
a general class of problems with convex losses. 

Unequal weights may also arise for computational reasons. The Lasso with Wj = 1 
is expected to perform similarly to weighted Lasso with data dependent 1 < Wj < Cq, 
with a fixed Cq. However, the weighted Lasso is easier to compute since Wj can be 
determined as a part of an iterative algorithm. For example, in a gradient descent 
algorithm, one may take larger steps and stop the computation as soon as the KKT 
conditions ([3]) are attained for any weights satisfying 1 < Wj < C . 

The weight function Wj can be also used to standardize the penalty level, for 
example with Wj = {'ipjj(P)} 1 ^ 2 , where is the j-th diagonal element of the 

Hessian matrix of ip{(3). When ?p(/3) is quadratic, for example in linear regression, 
Wj does not depend on f3. However, in other convex minimization problems, such 
weights need to be computed iteratively. 

Finally, in certain applications, the effects of a certain set S** of variables are 
of primary interest, so that penalization of 0s t , and thus the resulting bias, should 
be avoided. This leads to "semi-penalized" estimators with Wj = for j G S*, for 
example, with w~i = I{j ^ S*}. 

2.2 Basic inequalities, prediction, and Bregman divergence 

Let (3* denote a target vector for (3. In high-dimensional models, the performance of 
an estimator (3 is typically measured by its proximity to a target under conditions 
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on the sparsity of /3* and the size of the negative gradient —£((3*) = z — ip((3*). For 
£i-penalized estimators, such results are often derived from the KKT conditions ([3]) 
via certain basic inequalities, which are direct consequences of the KKT conditions 
and have appeared in different forms in the literature, for example, in the papers cited 
in the Introduction. Let D((3,(3*) = £((3) - £(/3*) - (£{(3*),/3- (3*) be the Bregman 
divergence |Bre67] and consider its symmetrized version |NN07] 

A(/3, /T) = D(f3, /T) + D(/3*, /?) = (/?- /T, im - iin)- (4) 

Since ijj is convex, A(/3,/3*) > 0. Two basic inequalities below provide upper bounds 
for the symmetrized Bregman divergence A(/3,(3*). The sparsity of (3* is measured 
by a weighted l\ norm of (3* in the first one and by the number of zero entries in the 
second one. 

Let S be any set of indices satisfying S D {j : /3* ^ 0} and let S c be the 
complement of S in {1, . . . , p}. We shall refer to S as the sparse set. Let W = diag(u;) 
for a possibly unknown vector w G R p with elements Wj > 0. Define 

z* = \{z - <K/3*)}s|oc, z\ = \Wg}{z - Hf3*)}sAoo, (5) 
= {wj < Wj Vj G S} n {wj < wj Vj G S c }, (6) 

where for any p-vector v and set A, va = {vj : j G A)'. Here and in the sequel Mab 
denotes the A x B subblock of a matrix M and Ma = Maa- 

Lemma 1 (i) Let (3* be a target vector. In the event Qq D {\(z — ijj((3*))j \ < WjX Vj} ; 

A(/3, (3*) < 2\\WP*\i < 2X\W(3*\ 1 . (7) 
(ii) For any target vector (3* and S D {j : (3* ^ 0}, the error h = (3 — (3* satisfies 

A{p* + h,p*) + (\-£)\W S chs*\i < (h s ,g s -{z-iP(f3*)}s) 

< (\w s \oo\ + 4)\hs\i (8) 

in Qo for a certain negative gradient vector g satisfying \gj\ < WjX. Consequently, 
in Qq fl {(l^slooA + Zq)/(X — zl) < £}, h ^ belongs to the sign-restricted cone 
S) = {b G S) : bS(P + b)~ < Vj G S c }, where 

S) = {beW: \W S cb S c\i < Z\bs\i ^ 0}. (9) 
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Remark 2.1 Sufficient conditions are given in Subsection 3.2 for {\{z — ip((3*))j\ < 
WjX Vj} to hold with high probability in generalized linear models. See Lemma [U 
Remarks [XJ and EO and Examples \3l \ \3~U [ and\3~4. 



A useful feature of Lemma [T] is the explicit statements of the monotonicity of the 
basic inequality in the weights. By Lemma [T] (ii), it suffices to study the analytical 
properties of the penalized criterion with the error h = (3 — (3* in the sign-restricted 
cone, provided that the event (|ws|ooA + Zq)/(X — z*) < £ has large probability. 
However, unless ^_(£, S) is specified, we will consider the larger cone in (Q in order 
to simplify the analysis. The choices of the target vector /?*, the sparse set S D {j : 
0] 7^ 0}, weight vector id and its bound w are quite flexible. The main requirement 
is that {\S\, Zq, zl} should be small. In linear regression or generalized linear models, 
we may conveniently consider 0* as the vector of true regression coefficients under a 
probability measure Pp. However, (3* can also be a sparse version of a true (3, e.g. 
(3* = (3jl{\(3j\ > t} for a threshold value r under P^. 

The upper bound in Lemma [1] (i) gives the so called "slow rate" of convergence 
for the Bregman divergence. In Section 3, we provide "fast rate" of convergence for 
the Bregman divergence via oracle inequalities for \hs\i in OH])- The symmetrized 
Bregman divergence A(/3, /?*) has the interpretations as the regret in prediction error 
in linear regression, the symmetrized Kullback-Leibler (KL) divergence in generalized 
linear models (GLM) and density estimation, and a spectrum loss for the graphical 
Lasso, as shown in examples below. 

Example 2.1 (Linear regression) Consider the linear regression model 

v 

Hi = ^ Xjjfij + gj, i = l,...,n, (10) 

3=1 

where jji is the response variable, predictors or design variables, and E{ is the 

error term. Let y = (jji, . . . ,y n )' and let X be the design matrix whose ith row is 
x l = fan, . . . , Xi P ). The estimator (Tj|) is a weighted Lasso with ip(f3) = |X/3||/(2n) 
and z = X'y/n in For predicting a vector y with E^. [y\X, y] = Xf3* , 

nA0,/3*) = \XP-XP*\1 

= Ep. [\y- XP\1\X, y] - min E^, [\y - S(X, y)\ 2 2 \X, y}} 
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is the regret of using the linear predictor X (3 compared with the optimal predictor. See 
lGR04^j for several implications of (0). 

Example 2.2 (Logistic regression) We observe (X,y) e R" x (f+ 1 ) with 
independent rows (x l ,yi), where yi e {0, 1} are binary response variables with 

Ppfe = l\x l ) = 7Tj (/3) = exp(x\8)/(l + exp(^/3)), 1 < z < n. (11) 
The loss function (QP is the average negative log-likelihood 

£((3) = rj,(p) - z'/3 with = ^ log(l+exp(x^)) ^ ^ = (12) 

i=l U 

Thus, is a weighted l\ penalized MLE. For probabilities {ti'.ti"} C (0,1), the 
KL information is K(ir',ir") = n' \og(n' / n") + (1 - 7r')log{(l - 7r')/(l - n")}. Since 
W) = Eti^iiP)/™ and logi^ix^*)) - logit^)) = x\P* - (3), © gives 

i=l 

r/ins, A(/3*,/3) zs i/ie symmetrised KL-divergence. 

Example 2.3 f GLM). The GLM contains the linear and logistic regression models 
as special cases. We observe (X,y) E lR nx (P +1 ) with rows (x\yi). Suppose that 
conditionally on X , y { are independent under with 

Vi ~ /(») = exp y — + ^ 2 J , Bi = x ft. (13) 

Lei /( n )(y|X, /3) = nr=i f{Vi\ xl ^)- The loss function can be written as a normalized 
negative likelihood £(f3) = (a 2 /n) log fr n )(y\X, ft) with z = X'y/n and = 
Yl^A^oi^ Z 3 ) + c (?/«; a )}/ n - The KL divergence is 

D (/.W^||/.W W )=^io g (Mggg). 

The symmetrized Bregman divergence can be written as 

2 

A0,/3*) = -{D(f {n) (-\X,f3*)\\f {n) (-\X,p)) +D(f (n) (-\X,p)\\f {n) (-\X,f3*))}. (14) 
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Example 2.4 ( Nonpar ametric density estimation) Although the focus of this 
paper is on regression models, here we illustrate that A(/3,/3*) is the symmetrised 
KL divergence in the context of nonparametric density estimation. Suppose the 
observations y = (yi, . . . , y n )' are iid from f{-\/3) = exp{(/3, T(-)) — ip(f3)} under 
P/3, where T(-) = (uj(-),j < p)' with certain basis functions Uj(-). Let the loss 
function £((3) in (QJ) be the average negative log-likelihood n~ l ^" =1 log f{yi\f3) with 
z = n' 1 ^!i =1 T{yi). Since EpT^yi) = ip(f3), the KL divergence is 

D{f(-\n\\fm) = E/3* log (j^-) = m - w) -w- w*)>- 

Again, the symmetrised KL divergence between the target density f(-\/3*) and the 
estimated density f(-\(3) is 

AC9 )j 9*) = J D(/(.| j 9-)||/(-| j 9))+D(/(.|3)||/(.| j 8*)). (15) 

IvdGOSij pointed out that for this example, the natural choices of the basis functions 
uj and weights Wj satisfy J Ujdv = and w\ = J u\dv. 

Example 2.5 ( Graphical Lasso) Suppose we observe X £ R nxp and would like to 
estimate the precision matrix (3 = (EX'X/n)' 1 £ W xp . In the graphical Lasso, (Q]] 
is the length normalized negative likelihood with ip{(3) = — logdet/3 ; z = —X'X/n, 
and {(3, z) = — trace(/3z). Since ip{(3) = Epz = —f3~ l , we find 

AGS, n = traced - - = ^T(A, - 1) 2 /A„ (16) 

3=1 

where (Ai,...,A p ) are the eigenvalues of (/3*)~ 1 / 2 /3(/3*)~ 1 / 2 . In graphical Lasso, the 
diagonal elements are typically not penalized. Consider vljk = I{j ^ k}, so that the 
penalty for the off-diagonal elements are uniformly weighted. Since Lemma[J\ requires 
\(z — ip(/3*))jk\ < Wjk\, (3* is taken to match X'X/n on the diagonal and the true 
(3° in correlations. Let S = {{j,k) : [3° k ^ 0,j ^ k}. In the event ma-Xj^l^jk — 
(3j k \ < A, LemmaUl (i) gives \\ (f3*)~ 1/2 f3(f3*)~ 1/2 - I pX p\\ 2 = o(l) under the condition 
jfSlAmaXjyfc \/3j k \ = o(l), where \\ ■ || 2 is the spectrum norm. \RBLZ08\J proved the 
consistency of the graphical Lasso under similar conditions with a different analysis. 
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3 Oracle inequalities 



In this section, we extract upper bounds for the estimation error 8 — 8* from the 
basic inequality (jHJ). Since (jHJ) is monotone in the weights, the oracle inequalities are 
sharper when the weights Wj are smaller in S = {j : (3* ^ 0} and larger in S°. 

We say that a function 0(o) defined in IR P is quasi star-shaped if 4>(tb) is continuous 
and non- decreasing in t G [0, oo) for all b G R p and lim^o 4>{b) = 0. All seminorms 
are quasi star-shaped. The sublevel sets {b : 0(o) < t} of a quasi star-shaped function 
are all star-shaped. For < n* < 1 and any pair of quasi star-shaped functions <po(b) 
and 0(o), define 

r A(B* + b B*)e' t)0 ^ i 
F(£, S; 0o, 0) = mf{ : b e S), &(&) < ( 17 ) 

where A(/3, /?*) is as in (J4j). We refer to F(£, S; 0o, 0) as a general invertibility factor 
(GIF) over the cone (j9j). The GIF plays a crucial role in developing the error bounds 
for 8-/3*. It extends the squared compatibility constant [vdGB09j and the weak 
and sign-restricted cone invertibility factors |YZ10j from the linear regression model 
with 0o(-) = to more general model ([I]) and from i q norms to general 0(-)- They 
are all closely related to the restricted eigenvalues |BRT09l IKol09] as we will discuss 
in Subsection 3.1. 

The basic inequality (jHJ) implies that the symmetrized Bregman divergence 
A(/3, 8*) is no greater than a linear function of \hg\i, where h = (3 — (3*. If A(/3, 8*) 
is no smaller than a linear function of the product \hs\i<p(h), then an upper bound 
for 0(/i) exists. Since the symmetrized Bregman divergence (jH) is approximately 
quadratic, A(/3,B*) ~ h'-ip(B*)h, in a neighborhood of 8*, this is reasonable when 
h = (3 — 8* is not too large and V^(/3*) is invertible in the cone. A suitable factor e^ ^ 
in (JTTj) forces the computation of this lower bound in a proper neighborhood of 8*. 

We first provide a set of general oracle inequalities. 

Theorem 1 Let {z*, z*} be as in (TJ) with S D {j : 8* ^ 0}, fi in (EJA < 77 < 77* < 
1, and {0o(&), 0(6)} 6e a pair of quasi star- shaped functions. Let 0i,s(6) = |6s|i/|S'|. 
In t/ie event 

I A- 4 F(^, 6;0 o ,0o) J 
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the following oracle inequalities hold: 



<p o 0-n<v, ^-^< e ^'?'^ + ,f , (19) 

F{£,b; 00,0) 

A(£ /T) + (A - < eV l7Jt, + Z f\ Sl - (20) 



Remark 3.1 Sufficient conditions are given in Subsection 3.2 for fl8\) to hold with 
high probability. See Lemma\^ Remarks \3.3\ and\3.4\ and Examples \3.2\ \3.3\ and\3.4 



The oracle inequalities in Theorem [T] control both the estimation error in terms of 
0o(/3 — P*) and the prediction error in terms of the symmetrized Bregman divergence 
A(/3,/3*) discussed in Section 2. Since they are based on (fT7|) in the intersection of 
the cone and the unit ball {b : 0o(6) < 1/e}, they are different from typical results in 
a small-ball analysis based on the Taylor expansion of i/j(P) at (3 = (3*. Theorem [1] 
does allow 0o(-) = with F(£, S; O , 0o) = oo and r] = in linear regression. 

3.1 The Hessian and related quantities 

We describe the relationship between the GIF (1171) and the Hessian of the convex 
function ip(-) in ([1]) and examine cases where the quasi star-shaped functions 0o(-) and 
0(-) are familiar seminorms. Throughout, we assume that ip{P) is twice differentiate. 
Let be the Hessian of ijj{P) and £* = xj; (/?*). 

The GIF ()17p can be simplified if for a certain nonnegative-definite matrix E, 

A(/T + 6, p*)e*°<® > (6, E6>, V 6 e 5), O (6) < 77*. (21) 

Since A(/3* + h, f3*) = f*(h, ip{P* + th)h)dt by (gD, (J2U) is a smoothness condition on 
the Hessian when £ = S*. In what follows, E = X* is allowed in all statements unless 
otherwise stated. Under ( 12~TI) . ( TTTj) is bounded from below by the simple GIF, 

F o (£,S;0) = inf j^. . (22) 

66»«,S) 1 6 S 1 10(6) 

In linear regression, F (£,S;(j)) is the square of the compatibility factor for 0(6) = 
01,5(6) = |6s|i/|5'| |vdG07] and the cone invertibility factor for 0(6) = g (6) = 
|6| g /|S'| 1 /' ? |YZ10j . They are both closely related to the restricted isometry property 
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(RIP) |CT05] . the sparse Rieze condition (SRC) |ZH08j . and the restricted eigenvalue 
[BRT09j . Extensive discussion of these quantities can be found in |BRT09t lvdGB09[ 
lYZlOj . The following corollary is an extension of an oracle inequality of |YZ10] for 
the linear regression model. 

Corollary 1 Let 77 < 77* < 1 . Suppose ( Tflj) holds. Then, in the event 

H {KUA + z* < min ({(A - z*), ve~ v F ^, S; O ))}, 

( TJ^) and ElOty hold with F(£, S; <po, 4>) replaced by the simpler F (£, S; <p) in ( TJ1| ). In 
particular, in the same event, 

mh) < ,, w, < en{]w it:f! s]ll \ v, > 0, , 23) 

■TOKQlb, <pg) 

with (j> q (b) = IblJlS] 1 ^ and h = /3 - (3* , and with (f> 1>s {b) = \b s \i/\S\, 

e-*h'Xh < A0,n < gC^UA+^gm _ (A _ z{) \ Wschs ^ (24) 

Here the only differences between the general model ([1]) and linear regression 
(<j>o(b) = 0) are the extra factor e v with rj < 1, the extra constraint |u>s|ooA + z$ < 
■qe^Fo^, S; <f>o), and the extra condition (ED). Moreover, (1221) explicitly expresses all 
conditions on F (£, S; (p) as properties of a fixed S. 

Example 3.1 (Linear regression: oracle inequalities). For ip((3) = 
\Xb\l/(2n) and S = X'X/n, Fq(£, S; 4> q ) is the weak cone invertibility factor / YZ1 0\j 



andF^^S; 01,5) is the compatibility constant \vdG0l^ 



M = mf jm. = mf (<™ (25) 
beV((,s) \b s \xn 1 / 2 beV(i,S)\\b s \i/\S\J 

They are all closely related to the l 2 restricted eigenvalues 

RE,K,S)= mf mf (^)" 2 (26) 

WR'im Since \b s \l < \b\ 2 2 \S\, k*(£,S) > RE 2 (£,S) lvdGB09f . For the Lasso 

with Wj = 1, 

\R-B*\ < |g|V2(A + ' 4) < |g|V2(A + *® < |g|V2(A + 4) (27) 
lP P 12 " SCIF 2 (^S) ~ Fo^S;^) ~ S)RE 2 (£, S) 1 ' 
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in the event X + Zq < £(A - z{) \YZ10f , where 

SCIF&, S) = inf \XbUUb), <P q = \b\J\S\ 1 '' 1 . 

Thus, cone and general invertibility factors yield sharper £ 2 oracle inequalities. 

The factors in the oracle inequalities in f l27|) do not have the same order for 
large \S\ and certain design matrices X. Although the oracle inequality based on 
5 , C/F 2 (^, S) is the sharpest in (1271) . it seems not to lead to a simple extension to the 
general convex minimization with ([T|). Thus, we settle with extensions of the second 
sharpest oracle inequality in (|27|) with -F (£, S; •). 



3.2 Oracle inequalities for the Lasso in GLM 

An important special case of the general formulation is the ^-penalized estimator in 
a generalized linear model (GLM) [MN89j . This is Example 12.31 in Subsection 2.2, 
where we set up the notation in ( fl~3l) and gave the KL divergence interpretation to 
()4j). The i\ penalized, normalized negative likelihood is 



i=i 



£({3) = - z 'f3, with M) = CJy, a) + V an d z = *X (28) 

^— ' n n 



Assume that ipo is twice differentiable. Denote the first and second derivatives of i/jq 
by ipo and ipo, respectively. The gradient and Hessian are 

ip(/3) = X'ip (9)/n and = X'&^{ii> Q {6))X/n, (29) 

where 9 = Xf3 and ipo and ipo are applied to the individual components of 9. 
A crucial condition in our analysis of the Lasso in GLM is 



max 

i<n 



^o(^* + th 



where Mi and ?7* are constants determined by ipQ. This condition gives 

J0 J0 thh^b^ri* 
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which implies the following lower bound for the GIF in (|T7|) : 

F(£, 5; 0o, 0) > mf n ^ /' /{tMilnty < 0o(&)K 

66»«,s),*)(6)<^ ^ n|6 s |i0(6) Jo 

inf E t ( f^ ™( ^y 6) .W)> (3D 
ben$,s),Mb)<v* ^ n\b s \i(i>(b) V M x ' v ; /' v ; 

due to (x*6) 2 Jq I{tM\\x % b\ < 4>o(b)}dt = min{|x*&|0o(fe)/Mi, (x l b) 2 }. For seminorms 
0o and 0, the infimum above can be taken over a fixed value of 0o(6) due to scale 
invariance. Thus, for 0o(6) = ikf 2 1 ^ 1 2 and seminorms 0, the lower bound in (l3Tj) is 

-f E t (X !2 ^(^.(^) 2 )- (32) 
If (130]) holds with rj* = 00, the convexity of e _t yields ( )2"Tj) with 

tf'- i^Sg ^^ (33) 

Li=i^o(^p *)(x*&) 2 

with an application of the Jensen inequality This gives a special Fq(£, S; 0o) as 

mf ,34) 

We note that since \Xb\oo < \X s \oo\b s \i + \X S oWg}\oo\Wsebse\ < {\X s \oo + 
^XscW^l^lbs] in the cone <if{£,S) in ©, for O (6) = M3I&5I1 with M 3 = 
MidXsloo + £\X$cWg c |oo}; flUJ) automatically implies the stronger 

e -M6)^s*6> < A(/3* + 6,/3*)<e*°W(6,E*6>, V6e <*?(£, S), o (6) < 77*. (35) 

Under condition (130|) . we may also use the following large deviation inequalities 
to find explicit penalty levels to guarantee (ITS]) . 

Lemma 2 (%) Suppose §W\) and (3U\) hold with certain {Mi, 77*} and the Wj in (EJ) 
are deterministic. Let Xj be the columns of X, S*^ 6e t/ze elements of X* = i/j({3*). 
For positive constants {A , Ai} define tj = A /{j G S*} + u>jAi/{j ^ S*}. Suppose 

M 1 max(|x i | 00 |t J /S* J ) < 77 e"° and ]T exp { - ^^-} < ^ (36) 

j'=i 
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for certain constants 770 < i]* and > 0. Then, P^*(zq < Ao, z\ < Ai| > 1 — eo- 
(ii) If c = max t ■?/>(£), i/ien pari (%) zs siiii va/id z/ awe? (El) ore replaced by 



n 2 t] 



3\2 



y^p\ - ? , 2 <-• (37) 



In particular, if \xj\l = n, 1 < j < p, Wj = 1, j ^ S 1 and A = Ai = A (so = X) } then 
part (i) still holds if A > oa/ (2c /n) log(2p/e ). 

The following theorem is a consequence of Theorem [TJ Corollary [1] and Lemma [2j 

Theorem 2 (^J Let (3 be the Lasso (TJj) with the loss function in (28\) . Let (3* be a 
target vector and h = f3 — (3* . Suppose [T3)) and [JO]) hold with certain {M 1; r/*}. Let 
F*(£, S; 0) be as in with S = {j : (3* ^ 0} and a constant M 2 . Let r] < 1 A r]* 
and {A, A , Ai} satisfy 

KlocA + A < min {£(A - X 1 ), V e^F*(^, S; M 2 | • | 2 )}. (38) 

Then, in the event f2 D { maxfc =0 ,i (^/Afe) < l} with the z* k in ([3]) and f2 in 

, (ft)s£! (^^) (39) 



/or all seminorms <fi. Moreover, if either [36}) or (3l\ ) holds for the {Ao, Ai} and W 
is deterministic, then 

{ ( fffPj) holds for all seminorms 0} > P^* (f2o) — eo- 

(mJ If v * = 00 and (dSP /io/ds , 5; M 2 | ■ | 2 ) replaced by the F*(£, S) in §3$, 

then the conclusions of part (i) hold with F*(£, S; •) replaced by the -F (£, S; •) in $2E) . 
Moreover, (3$) can be strengthened with the lower bound A(f3* + h, 0*) > e~ v {h, Y>*h) . 
(Hi) For any 77* > 0, the conclusions of part (ii) hold if F^(^,S) is replaced by 
k*(£,S)/(M 3 \S\) in (TJ2P with the M 3 in (E3J). 

Remark 3.2 Since <f> = (fio is allowed in §3§\) , / T5P|) implies (fio(h) < 77 with 4>o(h) = 
M 2 \h\ 2 in part (i) and the in [33]) in part (ii). Similarly, under the conditions of 
Theorem^ (Hi), M 3 \h s \i < rj < rj* , so that (OSJ) holds with b = h = (3 - (3* . 
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Remark 3.3 If either IfSh)) or (3l\ ) holds for {Xq, Ai} and W is deterministic, then 
implies holds} > P^(fi ) - eo- 

Remark 3.4 Suppose {min^s Wj, mm-,- S*^} are bounded away from zero, 
{maxj^sUJj, maxj Mi} are bounded, and {1 + F 2 (£, 5)}(logp)/n — >• 0. TTien, 
ao/ds A = Ai = aa\/ (2/n) \og(p/ e ) /or certain a < (1 + o(l)) maXj(S*)^ 2 /wj, 
dne io max{Ao, ?7, ?7o} — >■ 0+. Again, the conditions and conclusions of Theorem® 
"converge" to those for the linear regression as if the Gram matrix is £*. 

Remark 3.5 In Theorem® the key condition ( Tffffj) is weaker in parts (i) and (ii) than 
part (Hi), although part (ii) requires r]* = oo. ForE = S* and Mi = M 2 < M 3 /(l+£) ; 

S)/(M 3 \S\) < min {f*(£, S), F*(£, S; M 2 \ • | 2 )}, 

since n' 1 E"=i ^0(2^*) |^6| 3 /(6, E*6) < |Xo|oo < |fe 5 |iM 3 /Mi as in ine derivation of 
(35)) and I £>] 2 < (1 +£)|£>s|i m tne cone For iae more familiar S)/(M 3 \S\), 
(33) essentially requires a small \S\ a/ (log p) /n. The sharper Theorem® (i) and (ii) 
provides conditions to relax the requirement to a small \S\(logp)/n. 

Remark 3.6 For Wj = 1, INRWYlOf considered M-estimators under a restricted 
strong convexity condition. For the GLM, they considered iid sub-Gaussian x % and 
used empirical process theory to bound A(/3* + b, f3*) / {\b\ 2 (\b\ 2 ~ c o\b\i} from below 
over the cone |P|) with a small cq. Their result extends the i 2 error bound \S\ 1 ^ 2 (X + 
Zq)/RE%(£, S) of IBRTOty , while Theorem® extends the sharper p?7[ ) with the factor 
Fq(£, S; <p 2 ). Theorem® applies to both deterministic and random designs. Similar 
to jNRWYim . for iid sub-Gaussian x % , empirical process theory can be use to verify 
(EHP with F*(£,S;M 2 \ ■ | 2 ) > \S\~ 1/2 , provided that \S\(\ogp)/n is small. 

Example 3.2 (Linear regression: oracle inequalities, continuation) For the 

linear regression model ( flQj) with quadratic loss, iI)q{0) = 9 2 /2, so that UJD) holds 
with Mi = and rf = 00. It follows that F*(£, S; M 2 \ ■ \ 2 ) = 00 and (33) has the 
interpretation with rj = 0+ and ne _,? F*(£, S; M 2 \ ■ I2) = 00. Moreover, since Mi = 0, 
r]o = 0+ in k3~D) . Thus, the conditions and conclusions of Theorem® "converge" to 
the case of linear regression as Mi — > 0+. Suppose £j ~ N(0,a 2 ) as in For 
Wj = Wj = 1 and = Ym=i x %l n = 1> (ESP holds with A = Ai = a^J {2/n) log(p/e ) 
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and fifty) holds with A = Ao(l +£)/(! ~ 0- The value of a can be estimated iteratively 
using the mean residual squares ISBvdGlQ \SZll\j . Alternatively, cross-validation can 
be used to pick A. For <f)(b) = (f) 2 (b) = \b\ 2 /\S\ 1/2 , (TJS|> matches ftFty with the factor 
F (S t S;<h). 

Example 3.3 (Logistic regression: oracle inequalities) The model and loss 
function are given in < f77]j and [Wi) respectively. Here we verify the conditions of 
Theorem^ Condition fifty) holds with Mi = 1 and 77* = 00; Since ipo{t) = log(l + e*) ; 

■4>o(9 + t) _ e\l + e 9 ) 2 |e"l f l t<0 
MO) " (1 + e e +') 2 " j e -t (1 + e fl)2/ (e -t + e e )2 > e -\t\ t>Q 

Since max t ijj(t) = c = 1/4 we can apply (Sty . In particular, if wj — Wj — 1 — 
kilIM A = {(£ + l)/(£ - l)} v / (log(p/eo))/(2n) and A{2£/(£ + 1)}/F.($, S) < V e^, 
then fifty) holds with at least probability 1 — eo under P^*. For such W and X, an 
adaptive choice of the penalty level is A = ay/ (2/n) logp with a 2 = Y17=t ^(/^{l ~ 
7Tj(/3)}/n ; where vrj(/3) is as in Example \2.Sl 

Example 3.4 (Log- linear models: oracle inequalities) Consider counting data 
with yi G {0, 1, 2, ...}. In log-linear models, it is assume that 

E p ( yi ) = e e \ 6 i =x i p, \ <i<n. (40) 

The average negative Poisson log-likelihood function is 

hid = m - ^ m = t exp( ^ ) ' logfa!) . . = x'y/n. (4i) 

i=i 

Again this is a GLM. In this model, ipo(t) = e l , so that fifty) holds with Mi = 1 and 
rj* = 00. Although fif7\ ) is not useful with Co = oo ; fifty) can be used in Theorem^ 



4 Adaptive and multistage methods 

We consider in this section an adaptive Lasso and its repeated applications, with 
weights recursively generated based a concave penalty function. This approach 
appears to provide the most appealing choice of weights both from heuristic and 
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theoretical standpoints. The analysis here is based on the results in Section 3 and 
the main idea in [ZhalObj . 

Let p\(t) be a penalty function with p\(0+) = A, where p\{t) = (d/dt)p\(t). 
Define 

k= sup \P^~P^\. (42) 

0<ii<t 2 ^2 — t\ 

Let £ be as in (I2TI) and ^(^, 5") be the cone in (Q. Define 

F 2 (£, S) = inf { . o ^ 6 g 5) }. (43) 

The quantity F2(£,S) is slightly larger than the square of the restricted eigenvalue 
for a design matrix X when £ = X'X/n. Given < 6q < 1, the components of 



the error vector z — ip(/3*) are sub-Gaussian if for all < t < a a/ (2/n) log(4p/e ), 

P^{|(z-W*))il >t\ <2e~ nt2 ^ 2 \ (44) 



This condition holds for all GLM when the components of Xj3* are uniformly in the 
interior of the natural parameter space for the exponential family. 

Theorem 3 Suppose (21\) holds. Let k be as in (g^j, S = {j : f3* ^ 0}, A > 0, 
< X] < 1, < 70 < 1/k, A> 1, and £ > (A + I) /(A - 1). Suppose 

A {1 + A/(l - k 7 o)} < F (£, S; <P )ve- V i F* < F 2 (£, S), (45) 

for all S D So with | jS' \ iSo | < , where -F (£, S; 0o) is as in ( f^j) and -F 2 (£, S) as in 
( fT^I ). Let (3 be an initial estimator of (3 and (3 be as in (TJj) with Wj = p\(\(3j\)/\ and 
A = A\ /(l — K70). Then, 

\p-p*\* < y m {\Px(\Pso\)\2 + \{z - m)}s \2 + (« + ^ - j) \p - ri 2 } 

m i/ie eveni {|(/3 - f3) S c\ 2 2 < ll>?t} n {|z - ^(/3*)U < A }. Moreover, if g$ holds 
and A = ay/(2/n) log(2p/e ) < e < 1, then ~P/3*{\z - ip(/3*) \ > A } < e . 

Theorem [3] raises the possibility that (3 improves /3 under proper conditions. Thus 
it is desirable to repeatedly apply this adaptive Lasso in the following way, 

v 

= argmin {i((3) + ^ p^f^^k = 0, 1, . . . . (46) 

p j=i 
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Such multistage algorithms have been considered in [FL014 IZL08t IZhalObj . As 
discussed in Remark I4.ll below, it is beneficial to use a concave penalty p\ in (|46|) . 
Natural choices of p\ include the smoothly clipped absolute deviation and minimax 
concave penalties [FTM IZhaTOa] . 

Theorem 4 Let {k, Sq, Ao, r], 70, A, £, £*, A} be the same as Theorem^ Let (3^ be 
the unweighted Lasso with Wj = 1 in (d)) and (3^ be the i-th iteration of the recursion 
iHJB\ ) initialized with /3^°\ Let -F (£, Sq] fa) be the simple GIF in |Hj) with fa(h) = 
\h\2/\S\ lj/2 . Suppose [4^ holds and 



e T '{l + (1 - Kio)/A}y/\So\/F (Z, S ; fa) < lo V¥. (47) 

Define r = (e v /F if ){K + l/(j A) — k/A}. Suppose r < 1. Then, 

l m_ B * l < Ipa(I^ () |)| 2 + |{^-^)U| 2 rge"A{l + (l-K 7 o)M} (m 
IP P ' 2 - e-iF m (l - r„)/(l - ri) F (£,S ; fa)/\S \^ 1 } 

in the event 

|* - ^ )U < Ao) n { e -^ ( i- ro) ^ ^ A ^|- ^ 

Moreover, if holds and A = o^J (2/n) log(4p/e ) < e < 1, then the 

intersection of the events fijjfy and {\{z — tp(/3*)}s \2 < n~ 1 l 2 a^2\So\ log(4|S , o|/e )} 
happens with at least P^. probability 1 — e , provided that 



Pxm \)\2 + n-V*try/2\S \ log(4|S |/e )} < lo AX ^ 



e-^(l-r ) " I-K70 

Remark 4.1 De/ine i? (0) = e"A{l + (1 - ac T o) I ^0 1 1/2 /^ (C, S ] fa) and 
R (oc) = IPA(|/3g |)|2 + |{^-^(/3*)}s () | 2 ^ R{1) = (1 _ r e^ R (oo) + r ^(o) ) 

as in the right-hand side of fcffi ). Theorem^ asserts that \/3^ — (3*\ < 2R(°°) after 
£= I logr | _1 \og(R ( - oo y R^) iterations of the recursion fijfity. Under condition fcJ4\ ), 

Ep*R(°°) < {\ P x(m \)\2 + 2a^|So|Me7{F*(l - r )}. 
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Suppose p\(t) is concave in t, then \px{\^Sn\)\ 2 — Pa(0+)|5'o| 1//2 = A | S'o | 1//2 ■ This 
component of ~Eip*R(°°^ matches the noise inflation due to model selection since A x 
Ao = <ta/ (2/n) log(jo/eo). This noise inflation diminishes when min,, gj s \/3*\ > 7A 
when p\(t) = for \t\ > 7A, yielding the super- efficient error bound Ep.R(°°) < 
{2o~ \ / n^e 11 / {F*{\ — ro)}. This risk bound R(°°) is comparable with those for 
concave penalized least squares in linear regression IZhalOaf . 

Remark 4.2 For \og(p/n) x logp, the penalty level A in Theorems^ and \4\ are 
comparable with the best proven results and of the smallest possible order in linear 
regression. For log(p/n) logp, the proper penalty level is expected to be of the 
order a^{2/n) log (p / 1 S'o | ) under a vectorized sub-Gaussian condition which is slightly 
stronger than fi44\ )- This refinement for smaller p is beyond the scope of this paper. 

Remark 4.3 The constant factors used in Theorems^ and\4\ provide conditions of 
slightly weaker form than those based on sparse eigenvalues, although they typically do 
no imply each other due to differences in the dimension of covered models and various 
constant factors. If 4>o{b) = M 3 \bs\i can be used as in (E3J), then M 3 \S\F (C,, S; 0o) > 
-F 2 (£, S). In GLM, O = M2I&I2 can be used as in (CHj) to weaken this regularity 
condition. Since \bs\i < |£'| 1 / 2 |&s|2 and So C S, -F (£, So; ^2) > F (^, S; fa) > 

Remark 4.4 Although Theorem^ is valid for the smaller £ > (A + 1 — k^ ) /(A — l), 
the proof of Theorem^ requires £ > (A + 1)/(A — 1). 



5 Selection consistency 

In this section, we provide a selection consistency theorem for the i\ penalized convex 
minimization estimator, including both the weighted and unweighted cases. Let 
— max| u | oo <i l-M^loo for matrices M. 

Theorem 5 Let (3 be as in (TJ)] ; (3* be a target vector, z* k be as in (TJJ), f2o ^ n (Oil. 
S = {j : (3* ^ 0} and F(f , S; fa, fa as in fiTfy. 
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(%) Let < r) < rf < 1 and 38% = {(3 : (p ((3 - (3*) < r)}. Suppose 

sup ^Wso^sWHMm^Ws^ < K < 1, (51) 
sup IIW^^A^r'lL < «i- (52) 

Then, {j : (3j 0} C S in the event 

n* 1 = n Q n[\w s \ 00 X + z*<T ] e- T 'F{0,S;(f>o,(t>o), Kizt + 4 < (1 - k )a}. (53) 

Let < r] < 7]* < 1 and £g = {/3 : <p (f3 - (3*) < 77, sgn(/5) = sgn(/3*)}. Suppose 
(EZP and /ioW wi/i replaced by 3S§ and 

sup IK^r'IL^Mo, (54) 

TTien, sgn(/3) = sgn(/3*) m £/ie eni 

nin{\w s \oc\ + z* <M - 1 min (55) 

fra) Suppose conditions of Theorem [H /ioid /or t/ie GLM. Then, the conclusions 
of (i) and (ii) hold under the respective conditions if F(0,S;<po,<po) is replaced by 
F*(£, S; M 2 | • | 2 ) or , 5) or S)/(M 3 \S\) with the respective </> in Theorem^ 

For = 1, this result is somewhat more specific in the radius 77 for the uniforn 
unrepresentable conditon ([511 . compared with a similar extension of the selection 
consistency theory to the graphical Lasso by [RWRY08] . In linear regression ( [101) . 
ijj(f3) = S = X'X/n does not depend on /3, so that Theorem [5] with the special 
Wj = 1 matches the existing selection consistency theory for the unweighted Lasso 
[MB06| ITro06l IZY06| IWai09j . We discuss below the £1 penalized logistic regression 
clS cL specific example. 

Example 5.1 (Logistic regression: selection consistency) Suppose Wj = 1 = 
\xj\\/n where Xj are the columns of X. If |33|) and |53J) hold with z$ and z\ replaced 
by a/ (log(p/e ))/ (2n), then the respective conclusions of Theorem^ hold with at least 
probability 1 — €q in Pp* . 
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6 The sparsity of the Lasso and SRC 



The results in Sections 2 and 3 are concerned with the estimation and prediction 
properties of 0, but not dimension reduction. In this section, we provide upper 
bound for the dimension of (3. For this purpose, we need to strengthen (I2T1) to 



e -0oW E * < ^p* + 6 ) < e <Mfc) S * 5 V b G # (f , S), Mb) < V* 
We assume the following sparse Riesz condition, or SRC [ZH08] : 



c* < u'MP*)u < c* 



\S\ 



2(1 -a) 



e 2r >c* 



I - a) <d 



(56) 



(57) 



for certain constants {c*, c*}, integer <i*,0<a<l,0<?7<?7*<l, all A D S with 
| A | = d* and all u G JR A with \u\ = 1. The following theorem is an extension of the 
dimension bounds in [ZhalOaj from linear regression. 

Theorem 6 Let (3* and S be as in Theorem^ Consider the (3 defined in with 
Wj = 1 for all j . Suppose (56]) and |57| ) hold. Then, 



\s\ 



e 2r, c* 



- 1 



.2(1 - a) V c* 
in the event Qi is defined in [W\l , provided that 

\{^y A ll2 Un\2 < e^aAV^i - \S\)/c*. 



max 

ADS,\A\<d! 



For GLM, the results on the dimension bounds of the Lasso can be slightly 
simplified. Let X e = (£ - l)A/(£ + 1). Suppose d56j) and ([57]) hold and (A + A 5 ) < 
M^e-^F^O, S) with < rj < 1. Then, 



in the event {z* < A^}. The probability of the event {z* < A^} can be calculated 
using Lemma [2] as in the previous sections. 
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1-2(1 




a) 
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7 Discussion 



In this paper, we studied the estimation, prediction, selection and sparsity properties 
of the weighted ^-penalized estimators in a general convex loss formulation. 

We applied our general results to several important statistical models, including 
linear regression and generalized linear models. For linear regression, we extend 
the existing results to weighted/adaptive Lasso. For the GLMs, the £ q ,q > 1 error 
bounds for a general q > 1 for the GLMs are not available in the literature, although 
l\ and £2 bounds have been obtained under different sets of conditions respectively 
in |vdG08t INRWY10] . Our fixed-sample analysis provides explicit constant factors 
in an explicit neighborhood of a target. Our oracle inequalities yields even sharper 
results for multistage recursive application of an adaptive Lasso. 

An interesting aspect of the approach taken in this paper in dealing with general 
convex losses such as those for the GLM is that the conditions imposed on the Hessian 
naturally 'converge' to those for the linear regression as the convex loss 'converges' 
to a quadratic form. 

A key quantity used in the derivation of the results is the generalized invertibility 
factor f TT7|) . which grow out of the idea of the £2 restricted eigenvalue but improves 
upon it. The use of GIF yields sharper bounds on the estimation and prediction errors. 
This was discussed in detail in the context of linear regression in |vdGB09[ lYZlOj . 

We assume that the convex function ip(-) is twice differentiable. Although this 
assumption is satisfied in many important and widely used statistical models, it would 
be interesting to extend the results obtained in this paper to models with less smooth 
loss functions, such as those in quantile regression and support vector machine. 

8 Appendix 

Proof of Lemma [TJ. Since ip(/3) — ip(/3*) = z — ip(/3*) — g, ([3]) implies 

A0, /?*) = 0,z- ip(P*)) - \\W% - </T, z - - 9) 

and \gj\ < WjX. Thus, © follows from \{z — < Wj\ and Wj < Wj in 5" in Q . 

For (IE]), we have hs? = /?s c an d 13*$^ = 0, so that in f2 (EJ) gives 

A0,f3*) = (% c ,{ z -^(3*)} sc )-\\W S cM 1 -(h s ,{z-iP((3*)-g}s) 
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< \W S cMi(4 - A) + (h s , gs-{z- j>(P*)}s) 

< \W S cPs4i(z*i ~ A) + \h s \x(z* + KM). 

This gives ©. Since A(/3,/3*) > 0, h E VfaS) when (\w s \ooX + z* a )/{\ - z{) < f. 
For j # S, hStf + h)- = %{z - j>(P*) - g)j < \Pj\(w^ ~ 9j) < 0. □ 

Proof of Theorem [H Let h = (3 — (3* . Since i/j(/3) is a convex function, 

t^Atf* + th, 0*) = ^{ W + th) - t(h, ^(/T)) } 

is an increasing function of t. For < t < 1 and in the event dHJ implies 
t^AiP* +th,p*) <A(h + P*,p*) < (K|ooA + «S)IMi- 

By ® and (ED, F(£,S;0 o ,0 o ) < A(P*+th } P*)e^ th ) / {t\h s \i(j) Q {th)} for O (^) < 
Thus, for <fro{th) < mm{r]*, <f) (h)} and in the event 



o (^)e" 0o( * ft) < 



A(/3* + tfr, /3*) 



< M^±4_ <r]e - 



If r/* < (po(h), the above inequality at (/>o(th) = if would give rfe - ^ < rje -11 , which 
contradicts to rj < 77* < 1. Thus, 77* > 0o(^) an d 0o(^)e~* ^ < ^e _,? for all 
< t < 1. This implies </>o(^) < 77 < 77* ■ Another application of (JH]) yields 

^ j " F(e,^;0 o ,0)|^|i " F(£,S;0 o ,0) • 
We obtain ( 1201 by applying ( TT9l with = 0i f s to the right-hand side of (jHJ). □ 
Proof of Lemma El (i) Since ip{P) = YJU x i Mx i P)/n by (EHJ), 



i=l 



= exp 



i=i 



^2 



(1 - t)dt . (58) 



This and (SO} imply that for M^Xfe^ < r/ , 



E^exp{-^/(z-^(/3*))} <exp 



(59) 
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Since maxfc =0) i z* k jXk = max,- t - 1 \zj — ipj(j3*)\ by (0), 

p 

P^{max4/A fe >l} < Y,Pp-{\z j -i> j (P*)\>t j } 

j'=i 
v 

< 22 E P* ex P { ~^ h o I z i - i>i (Z 3 *) I - ) 

with 6j = e~ m tjjYT^. Since Mi max,.,- |iCij|6j < 770, (1591) gives 

r 1 p / ne~ m t 2 \ 

P s .{max-/A t >l}<^2ex P (-^). 

(ii) If (157]) holds, we simply replace ^ (V(/3 + tb)) by c in (158]) . The rest is simpler 
and omitted. □ 

Proof of Theorem El (i) Since 5; 0) in (135ft is a lower bound of F(£, S; O , <P) 
in (ITTj) . ( 139]) follows from Theorem [1] with 4>o(b) = M 2 \b\ 2 . The probability statement 
follows from Lemma [2] (ii) Since ( 121]) holds for the (po(b) in (133]) . we are allowed to 
use S) = -F (£, S; 0o) in Corollary [H The condition 77* = 00 is used since </>o(6) 
does not control M\\Xb\vo. (hi) We are also allowed to use <po(b) = M 3 \bs\i in ( 135]) 
due to M^Xbl^ < <p Q {b). □ 

Proof of Theorem [3j Let h = /?-/?*, Wj = Wj and 5 = {j : |^| > 7 A} U S . 
For j ^ S, WjA = Pa(/3j) > Pa(0+) - k 7o A = (1 - k 7o )A, so that z\ = \Wg}{z - 
^(/3*)}s c |oo < A /(l — Kjo) = A/A We also have Zq < (1 — kj )X/A. Since \w\oo < 1, 
these bounds for Zq and z* yield 

\w s \ooX + z* A + (1 - «7o)A/A = A + 1 - k 7o 
A-z* - A -A/A A-l 

Thus, by Lemma [1] 

h e S), A(/3* + h, n < \h s \ 2 (\w s \ 2 X + \{z- ^(/3*)} s | 2 ) 
Since |S \ So| < |(/3 - /?*)*§ 1 2/76^ < we have 

KM + 4 < A + (1 - « 7o )A/A < F (£, S; ^o)^-". 
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Thus, 0oO) < V by AH- It follows that A0, /?*) > e^h'Eh by (JUJ, so that by (135]) , 

e-^|/i| 2 < e^F 2 (^S)\h\ 2 < h'Vke-"/\h s \ 2 < A(/3* + h, /3*)/\h s \ 2 

when \hg\ ^ 0. Consequently, 

e-"F„|/i| 2 < |w s | 2 A + \{z - i){F)} s \ 2 . (60) 

Since Wj\ = p\{\Pj\) < P\(\f3*\) + «|$ - we have 

|^| 2 A< |p A (|/3£J)| 2 + /^-/3*| 2 - 

Since |^-^(/3*)|oo < (1 - «7o)A/A 

|{*-^(/3*)} s | 2 < |{^-^(r)}5 |2 + |^\^o| 1/2 (l-K7o)AM 
< |{z - ^(/3*)}5 | 2 + |0 - /3*|2(1 - K 7o )/(7oA). 

Inserting the above inequalities into (1601 . we find that 

The probability statement follows directly from ( 144|) with the union bound. □ 

Proof of Theorem H Let i?^ be as in Remark 14,11 For |z — //'(/?*) |oo ^ A , 
Corollary [1] gives 

|^(o) _ ^ | 2 < e n (A + Ao )|5'or /2 /Fo(e, ^o; <h) = R { " ] - 

Under conditions (|4Zl) and fl49]) . we have < 7 Av^* for all I > 0. We prove (|48]) 
by induction. We have already proved ( )48i) for £ = 0. For ^ > 1, we let j3 = fi^* 1 ) 
and apply Theorem EJ - /3*| 2 < (1 - r )-R (oo) + ro^" 1 ) = The probability 
statement follows directly from (j44|) with the union bound. □ 

Proof of Theorem [5], We first prove the more complicated part (ii). Let J = 
z — ip(f3*) and A be fixed. Consider 

p 

0{X,t) = argmin{^(/3) - (f3,ip{/3*) +tz) +t\J2™jW ■ Ps° = o} (61) 

/3 j= i 
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as an artificial path for < t < 1. For each t, the KKT conditions for /3(A,t) are 



-■ tWj\Bgn(Pj(\, t)) V/^A, t) ^ 



where g(X,t) = -?p(f3(\,t)) +tz. Let h(X,t) = P(\,t) - (3*. Since h S c = 0, 

the proof of Theorem [T] for £ =0 yields 

0o(?(A,t)-/3*)<77, V0<*<1. (62) 

Since i>s(P*) is positive-definite, /3(A,0+) = It follows that sgn(/3 5 (A, £)) = 
sgn(/3£) for < t < t x for a certain < t\ < 1. An application of the differentiation 
operator D = (d/dt) to the KKT condition yields 

z 3 - $ jiS @(\,t)){(D$)(\,t)} s = %Asgn(/3*), \/ 3 eS,0<t< h. 

Thus, for < t < h 

(DP) S (\, t) = {j> s @{\, - AW^5 sgn(/3*)} (63) 

and with an application of the chain rule, 

Di 8 o0{\t)) = i S c >s 0(\t))(Dp)s(\,t) 

= ^ Ci5 (^(A,t)){^ 5 (^(A,t))}- 1 {^ - \W s sgn(f3*)}. (64) 

By ([62]), 0(X,t) e for < t < t v It follows from (15511 , (151) and (1331) that 
|(£>?)s(A, t)U < Mo|5s - AWssgnO^U < M (\w s \oo\ + z* ) < min - e x 

for < t < ti and some ei > 0. Thus, |/is(A, t)!^ < tM (\ws\oo^ + z o) < mm .jes 1/^*1 — 
ex. This implies sgn(/3(A, t— )) = sgn(/3*) for < £ < 1 by the continuity of 0(\,t) in 
t, i.e. ti = 1. Since |W<7 < \v s \oo for all v e H p in fi , O, (EI]) and (1321) 

implies that for < £ < 1 

IW^DM/SCA,*))! < |^ 5 - 1 ^,5(^(A,t)){^(^(A,t))}-%|oo 

+A|W^ S c, s (?(A, t)){^ s (?(A, t))}" 1 ^ sgn(^) U 

< + K X. 
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This implies \WgHs°(Pfa !))loo < WgHs^*)\oo < «i2g+zJ+«oA < A. 

It follows that 

4(J(A, 1-)) = ^Asgn(/3*), sgn(/3*) = sgn(?(A, 1-)), j G 5 
iMX,l-)) ewjX[-l,l], j 

These are the KKT conditions for (3(X, 1—) with sgn(/3(A, 1—)) = sgn(/3*). 

The proof for part (ii) is similar, with sgn(/3*) replaced by sgn(/3(A, £)) in the proof 
of part (i). Finally, in part (iii), Fq(£, S; fa, fa) is simply replaced by its lower bounds 
with the respective fa. □ 

Proof of Theorem gfl Let A x = {j : \ gj \ = A} U S, A = A x \ S and S = ^(/?* + 
th)dt, where g is the negative gradient in ([3]) and h = (3 — f3*. Let = (gjl{j £ -A})'- 
Consider the case |Ai| < d* (e.g. with sufficiently small z*). Since g Ao h Ao = A|/i Ao |i 
by © and + h) - £(/3*)} Al = E Al V, 

Since |S^ /2 ^ (>lo) || + |S^ 1/2 ^ (Al) |2 = |E^ /2 ^ (5 )|| + 2# (A)) E^ (Al) , we have 

|S Al V Vo)l2 + l^'Voll < 1^(5)12 + 2\^/ 2 g {Ao 0~l /2 i Al (ni 
Thus, in the event \L~~ A \ /2 i M {P*)\ < aX^\A Q \/ (c*e^) with < a < 1, we have 

(1 - a)\t~ A \ /2 g {Ao) \l + l&^Vol 2 < |S^ (fl) |» + «A 2 |A |/(c^). 
Since the eigenvalues of E Al lie in the interval c*e~ v and c*e v and 5u — Xsgn(/3 Ao ), 



(l-a)X 2 \A \ , A 2 |A | + |^|2 aA 2 |A 



o 



This gives 

We note that \t~ A \ /2 Z M (f3*)\ < e^ 2 max ADSt \ A] < d . \(E*) A 1/2 £ A ((3*)\. We complete the 
proof by considering the artificial path (I6T1) . □ 
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